Skip to content

[SPARK-11994][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max#9989

Closed
tmnd1991 wants to merge 8 commits into
apache:masterfrom
tmnd1991:SPARK-11932
Closed

[SPARK-11994][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max#9989
tmnd1991 wants to merge 8 commits into
apache:masterfrom
tmnd1991:SPARK-11932

Conversation

@tmnd1991

Copy link
Copy Markdown
Contributor

No description provided.

@tmnd1991 tmnd1991 changed the title [Spark 11932] [Spark-11932][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Nov 26, 2015
@tmnd1991 tmnd1991 changed the title [Spark-11932][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max [SPARK-11932][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Nov 26, 2015

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 << 25? Can you explain the rationale in a quick comment here, and what 41 comes from?

@tmnd1991

Copy link
Copy Markdown
Contributor Author

I explained it on the Jira issue, I'm going to explain it again here:
Since spark.kryoserializer.buffer.max defaults to 64MB, I decided to increase the number of partitions the model gets divided into at half that size (32MB).
One word2vec entry consists of an array of float of size vectorSize and a string, since the size of string is variable and considerably lower than the size of the array, I'm not going consider it in my size computation.
The number of partitions the model gets splitted into is given by the formulae:
(4 * numWords * vectorSize / 33554432) + 1
Where 4 is float size, numberWords and vectorSize are respectively the number of words the model contains and the size of each array, and 33554432 is 32MB in bytes.
One more sophisticated solution would be to read the spark.kryoserializer.buffer.max value at runtime, but it would be kinda meaningless, because when we're saving the model we're not sure that the property value will be the same when we load it. It can be a different Spark application.
I don't get the 1 << 25? question

@srowen

srowen commented Nov 26, 2015

Copy link
Copy Markdown
Member

@tmnd1991 Yes, I know. I'm suggesting you explain it not in the JIRA or PR, but briefly as comments in the code.

1 << 25 is also 32MB in that it's 2^25. It seems like less of a magic number than "33554432". (Especially since you made it a long but with an 'l' instead of 'L' which makes it look like the number 335544321. Up to your taste though.

@tmnd1991

Copy link
Copy Markdown
Contributor Author

I adjusted the code as you suggested

@srowen

srowen commented Nov 28, 2015

Copy link
Copy Markdown
Member

This is tagged to the wrong JIRA right? It's SPARK-11994

@tmnd1991 tmnd1991 changed the title [SPARK-11932][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max [SPARK-11934][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Nov 28, 2015
@tmnd1991

Copy link
Copy Markdown
Contributor Author

Yes, it was wrong, I fixed it, sorry.

@SparkQA

SparkQA commented Nov 28, 2015

Copy link
Copy Markdown

Test build #2126 has finished for PR 9989 at commit a2f6b0b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tmnd1991

Copy link
Copy Markdown
Contributor Author

I adjusted the code style.

@srowen

srowen commented Nov 28, 2015

Copy link
Copy Markdown
Member

(The title is still wrong, it's SPARK-11994)

@SparkQA

SparkQA commented Nov 28, 2015

Copy link
Copy Markdown

Test build #2127 has finished for PR 9989 at commit fdbe2a7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tmnd1991 tmnd1991 changed the title [SPARK-11934][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max [SPARK-11994][Mllib] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max Nov 28, 2015
@tmnd1991

Copy link
Copy Markdown
Contributor Author

Something went wrong with the commit.
Now should be fine. Never commit before a good coffee!

@srowen

srowen commented Nov 28, 2015

Copy link
Copy Markdown
Member

@MechCoder @rxin any thoughts on this one? looks reasonable unless there's some reason it has to be in 1 partition.

@SparkQA

SparkQA commented Dec 1, 2015

Copy link
Copy Markdown

Test build #2134 has finished for PR 9989 at commit e286e66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen

srowen commented Dec 2, 2015

Copy link
Copy Markdown
Member

Checking with @mengxr @jkbradley too just in case

@srowen

srowen commented Dec 5, 2015

Copy link
Copy Markdown
Member

Merged to master

@asfgit asfgit closed this in e9c9ae2 Dec 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants